feat(api): accept model overrides on session creation and add runtime model switching endpoints by dgageot · Pull Request #2791 · docker/docker-agent

dgageot · 2026-05-13T14:01:51Z

What changes

`POST /api/sessions` honours model selection (#2780)

The handler binds the request body to a session.Session so the existing
agent_model_overrides and custom_models_used JSON fields are now picked
up and copied onto the new session in SessionManager.CreateSession. The
runtime applies them the first time it is built for the session (see
runtimeForSession), so a freshly created session can be pinned to a
specific model without an extra round-trip.

Runtime model switching over HTTP (#2781)

Two new endpoints:

GET /api/sessions/:id/models → list of runtime.ModelChoice for the
session's current agent, with IsCurrent set on the active selection
(override or default), and any inline provider/model ref from the
session's history (or override) appended as a synthetic choice.
PATCH /api/sessions/:id/model → applies a per-agent override and
persists it. Empty model clears the override and reverts to the
agent's configured default.

POST /api/sessions/:id/model is also accepted because the existing
pkg/runtime Client.SetAgentModel (used by RemoteRuntime) was
historically a POST — so no client upgrade is required.

Status code mapping:

outcome	code
ok	200
invalid JSON body	400
no runtime attached for session	404
runtime doesn't support model switching	422
invalid model ref	400

Implementation notes

Reuses runtime.ModelChoice directly as the wire type (added JSON tags).
No duplicate pkg/api.ModelChoice.
The TUI App.AvailableModels now delegates to a shared
runtime.DecorateModelChoices helper, which is also used by
SessionManager.AvailableSessionModels. HTTP and TUI clients see the
same picker state.
SessionManager.SetSessionAgentModel snapshots state before mutation
and rolls back both the in-memory session and the runtime override
if sessionStore.UpdateSession fails.
New sentinel ErrSessionNotRunning distinguishes "session not found
or not running" from generic errors so the HTTP layer can map it to
404 deterministically.
runtimeForSession re-applies persisted overrides via a small
applyStoredOverrides helper. Failures are logged at WARN — a stored
override that no longer resolves must not prevent session resumption.

Tests

pkg/server/session_models_test.go (new) — end-to-end coverage for
GET/PATCH/POST, IsCurrent decoration, custom models appended, empty
body clears override, store-write rollback, runtime-failure rollback,
404 when no runtime attached, 422 when runtime doesn't support
switching, POST verb compat.
pkg/runtime/model_switcher_test.go — table test for
DecorateModelChoices corner cases.
pkg/server/session_manager_test.go — SupportsModelSwitching on
fakeRuntime.

Validation

task lint — clean.
go test -race -count=3 ./pkg/server/... ./pkg/runtime/... ./pkg/api/... — green.
Full repo go test — green except pre-existing environmental failures
in pkg/teamloader (require Docker Model Runner running locally,
unrelated to this PR).

Commits

feat: support model overrides when creating sessions
feat: add GET/PATCH/POST endpoints for runtime model switching
fix: add backward compat note for POST /sessions/:id/model
fix: thread-safe model reads and persistence rollback
test: enhance model selection test coverage and fix goroutine leaks
refactor(runtime): extract DecorateModelChoices and remove duplication
fix: distinguish "session not running" from other errors

dgageot · 2026-05-13T14:19:17Z

Thanks for the review! Addressed all three findings in 3 separate commits:

[MEDIUM] `DecorateModelChoices` mutates caller-supplied slice — fixed in `96ea8bb`

The function now allocates a result slice (sized len(models)+len(customRefs)+1 to fit the synthetic-override case), copies the input, and operates on the copy. Doc-comment updated to spell out the non-mutation guarantee. Added TestDecorateModelChoices/does_not_mutate_the_input_slice that snapshots the input before calling and asserts deep-equality after.

[MEDIUM] Unknown errors mapped to 400 — fixed in `ad1366a`

The handler's default branch now returns 500 (mirroring the companion getSessionModels handler). 400 is reserved for the upstream Bind failure on the request body. Added TestAttachedServer_SetSessionModel_StoreFailureReturns500 driving a synthetic UpdateSession failure through the HTTP layer and asserting 500.

[HIGH] Lock held across blocking runtime I/O — fixed in `be9e8ae`

Added a per-session modelSwitch sync.Mutex on activeRuntimes to serialise concurrent SetSessionAgentModel calls on the same session without blocking other sessions.
AvailableSessionModels now snapshots current and customRefs under sm.mux briefly, releases the lock, then calls runtime.AvailableModels(ctx).
SetSessionAgentModel is restructured so sm.mux is taken in three short critical sections (snapshot prev → mutate session → rollback in-memory) but released around the slow runtime calls (SetAgentModel apply + rollback) and the store write. Atomicity wrt other model-switches on the same session is preserved by the new modelSwitch lock; the rollback contract is unchanged.
Added two synchronisation-style tests that park a deliberately-slow runtime call open and verify that an unrelated SetSessionStarred (sm.mux-acquiring) completes within 2s while the runtime call is still in flight:
- TestSessionManager_AvailableSessionModels_DoesNotHoldMuxDuringRuntimeIO
- TestSessionManager_SetSessionAgentModel_DoesNotHoldMuxDuringRuntimeIO

Validation

task lint — clean.
go test -race -count=5 ./pkg/server/... — green.
go test -race -count=3 ./pkg/server/... ./pkg/runtime/... — green.

Allow callers to specify agent_model_overrides and custom_models_used when creating sessions via POST /api/sessions, and wire the runtime with model switcher configuration to apply those overrides when the session is first run.

Add GET /api/sessions/:id/models to list available models and the current override (if any), and PATCH/POST /api/sessions/:id/model to apply a model override for the session's current agent. Extend ModelChoice with JSON tags for wire format and add SessionModelsResponse type. Both endpoints return 422 if the runtime doesn't support model switching.

PATCH is the canonical verb for updating the agent's model. POST is also accepted because pkg/runtime Client.SetAgentModel (used by RemoteRuntime) was historically a POST; keep both so a client upgrade is not required.

Fix a data race in AvailableSessionModels where rs.session is read without holding sm.mux (the lock protecting reads/writes of AgentModelOverrides and CustomModelsUsed by SetSessionAgentModel). In SetSessionAgentModel, snapshot the in-memory override state before mutating the runtime and session. If sessionStore.UpdateSession fails after the runtime mutation, roll back both the in-memory session state and the runtime override so callers don't observe a runtime/store mismatch on the next request. Fixes issues identified in review and validated with go test -race.

Extract a startAttachedServer helper that wires up a SessionManager + HTTP server with t.Cleanup(ln.Close) so the Serve goroutine exits cleanly when the test finishes. Replaces boilerplate in all model-switching tests and prevents goroutine leaks. Add tests for key picker scenarios: - TestAttachedServer_GetSessionModels_DefaultMarkedCurrent: default marked current when no override is active - TestAttachedServer_GetSessionModels_AppendsCustomModels: session's custom model history appears in the picker - TestSessionManager_SetSessionAgentModel_RollsBackOnStoreFailure: store failure rolls back both in-memory session and runtime state - TestDecorateModelChoices (table): corner cases for the decorate helper Update TestAttachedServer_GetSessionModels to verify the IsCurrent decoration flow end-to-end and to remove the manual IsCurrent: true from the fixture (now correctly set by the manager).

Map ErrSessionNotRunning to 404 so callers can tell apart a stale session ID from a server-side problem. Introduce the sentinel error and replace four ad-hoc errors.New() calls with it: SteerSession, FollowUpSession, AvailableSessionModels, SetSessionAgentModel. Update the HTTP handlers to check for this sentinel explicitly and return 404 accordingly. Also add TestAttachedServer_ModelEndpoints_404WhenNotRunning to verify both GET /sessions/:id/models and PATCH /sessions/:id/model return 404 when no runtime is attached.

Add GET/PATCH/POST /api/sessions/:id/model(s) to the endpoint table. POST is accepted for backward compatibility with RemoteRuntime's historical SetAgentModel implementation. Assisted-By: docker-agent

dgageot · 2026-05-18T08:38:38Z

Addressed the blocking docs feedback from @aheritier in commit da48ca8:

Added all three model-switching endpoints to the API server documentation:

GET /api/sessions/:id/models — List available models for the session's current agent
PATCH /api/sessions/:id/model — Set or clear the agent's model override
POST /api/sessions/:id/model — Set or clear the agent's model override (backward compat with RemoteRuntime)

The endpoints are now documented in docs/features/api-server/index.md in the Sessions table.

All commits pass task lint, task test, and task build. Branch has been rebased on origin/main and force-pushed.

dgageot · 2026-05-18T08:42:05Z

/review

docker-agent

Assessment: 🟡 NEEDS ATTENTION

One concurrency consistency issue was found in the new model-switching logic. No data corruption or panics — the impact is limited to a transient stale read from during a store-write failure + rollback scenario.

Reviewed files (risk-score-weighted):

pkg/server/session_manager.go (risk: 6) — thread-safety of SetSessionAgentModel/AvailableSessionModels ✅ (one issue found, see inline)
pkg/server/server.go (risk: 1) — HTTP status code mapping ✅ correct
pkg/runtime/model_switcher.go — DecorateModelChoices logic ✅ correct
pkg/server/session_models_test.go — test coverage ✅ comprehensive

docker-agent · 2026-05-18T08:45:42Z

+	}
+	sm.mux.Unlock()
+
+	if err := sm.sessionStore.UpdateSession(ctx, sess); err != nil {


[MEDIUM] Stale model override briefly visible to AvailableSessionModels during store-write window

Between the sm.mux.Unlock() on line 931 (after mutating the in-memory session) and the sm.mux.Lock() inside the rollback block, UpdateSession is called without holding sm.mux. Because this is an I/O call, it can take arbitrarily long. During this window, a concurrent GET /api/sessions/:id/models request calls AvailableSessionModels, which acquires sm.mux and reads rs.session.AgentModelOverrides[agentName] — observing the not-yet-persisted (and about-to-be-rolled-back) override.

The code comment says "Roll back in-memory under sm.mux first so concurrent readers never see the half-applied state", but that guarantee doesn't hold: sm.mux is released before the rollback lock is re-acquired, so there is a real window.

Impact: GET /models can transiently report a model override that was ultimately never persisted (because the store write failed). The in-memory state is corrected by the rollback, but any client that polled the endpoint during the window would see an incorrect IsCurrent marker.

Suggested fix: Keep the session mutation and the store write under a single critical section, or (if the I/O overhead is unacceptable) hold the modelSwitch lock long enough that the mutation + store write + any rollback are treated as an atomic unit visible only after the store confirms success. Alternatively, mutate the in-memory session after the store write succeeds, never before.

The previous implementation mutated the in-memory session before persisting it to the store. Between the mutation and the store write, concurrent calls to AvailableSessionModels could acquire sm.mux and observe the not-yet- persisted override. If the store write then failed, the rollback would correct the in-memory state, but any concurrent reader that polled during the window would have seen an incorrect IsCurrent marker. This commit restructures SetSessionAgentModel to: 1. Clone the session 2. Apply mutations to the clone 3. Persist the clone to the store 4. Only after persistence succeeds, apply the mutations to the live session under sm.mux Now concurrent readers never observe a state that hasn't been successfully persisted. The rollback path is simplified: if the store write fails, the live session is unchanged and we only need to roll back the runtime. Addresses the concurrency consistency issue identified in the latest docker-agent review (2026-05-18). Assisted-By: docker-agent

aheritier

All Blocking items from prior review rounds are resolved and verified by new tests; CI is fully green. Approving. Items below are non-blocking polish — feel free to address in a follow-up or defer.

✅ Resolved and verified:

DecorateModelChoices no longer mutates input; defensive-copy test added.
setSessionModel default branch returns 500, matching getSessionModels; store-failure test added.
sm.mux no longer held across runtime calls; per-session modelSwitch lock + contention tests added.
Three new endpoints documented in docs/features/api-server/index.md.

Nice work on the app.go refactor — 67 lines → 5-line delegation with full helper coverage.

aheritier · 2026-05-18T10:17:48Z

+	// Clone the session for the store write. We'll apply mutations to the
+	// clone, persist it, and only then update the live session. This ensures
+	// concurrent readers never observe a not-yet-persisted state.
+	updatedSess := &session.Session{
+		ID:                      sess.ID,
+		Title:                   sess.Title,
+		CreatedAt:               sess.CreatedAt,
+		WorkingDir:              sess.WorkingDir,
+		ToolsApproved:           sess.ToolsApproved,
+		Permissions:             sess.Permissions,
+		MaxIterations:           sess.MaxIterations,
+		MaxConsecutiveToolCalls: sess.MaxConsecutiveToolCalls,
+		MaxOldToolCallTokens:    sess.MaxOldToolCallTokens,
+		InputTokens:             sess.InputTokens,
+		OutputTokens:            sess.OutputTokens,
+		Cost:                    sess.Cost,
+		Starred:                 sess.Starred,
+	}
+
+	// Clone the maps/slices under sm.mux to avoid data races
+	sm.mux.Lock()
+	if sess.AgentModelOverrides != nil {
+		updatedSess.AgentModelOverrides = maps.Clone(sess.AgentModelOverrides)
+	}
+	if len(sess.CustomModelsUsed) > 0 {
+		updatedSess.CustomModelsUsed = append([]string(nil), sess.CustomModelsUsed...)
+	}
+	sm.mux.Unlock()
+
+	// Apply the mutations to the cloned session
+	var appendedCustomUsed bool
+	if modelRef == "" {
+		delete(updatedSess.AgentModelOverrides, agentName)
+	} else {
+		if updatedSess.AgentModelOverrides == nil {
+			updatedSess.AgentModelOverrides = make(map[string]string)
+		}
+		updatedSess.AgentModelOverrides[agentName] = modelRef
+
+		// Track inline provider/model references so they remain easy to
+		// re-select via the model picker (mirrors App.SetCurrentAgentModel).
+		if strings.Contains(modelRef, "/") && !slices.Contains(updatedSess.CustomModelsUsed, modelRef) {
+			updatedSess.CustomModelsUsed = append(updatedSess.CustomModelsUsed, modelRef)
+			appendedCustomUsed = true
+		}
+	}
+
+	// Persist the cloned session. If this fails, the live session is
+	// unchanged and we only need to roll back the runtime.
+	if err := sm.sessionStore.UpdateSession(ctx, updatedSess); err != nil {
+		rollback := prevOverride
+		if !hadOverride {
+			rollback = ""
+		}
+		if rbErr := rs.runtime.SetAgentModel(ctx, agentName, rollback); rbErr != nil {
+			slog.ErrorContext(ctx, "Failed to roll back runtime model override", "session_id", sessionID, "agent", agentName, "error", rbErr)
+		}
+		return "", "", fmt.Errorf("failed to persist model override: %w", err)
+	}


Non-blocking — Nice fix on the clone-then-persist pattern — this kills the 'picker shows new override that gets rolled back' window. There's a smaller residual window between runtime.SetAgentModel succeeding and the post-UpdateSession sm.mux.Lock(): during that gap the runtime is on the new model but AvailableSessionModels reads the old override from rs.session, so GET /models reports IsCurrent on the previous model while requests already execute against the new one. Strictly less bad than what was fixed (briefly stale, never fictitious). Closing it would mean holding modelSwitch around the GET path — non-trivial. A short doc-comment acknowledging the residual window or a follow-up ticket would be enough. Not a merge blocker.

aheritier · 2026-05-18T10:17:50Z

+	var (
+		hadOverride     bool
+		prevOverride    string
+		hadOverridesMap bool


Nit — hadOverridesMap is only used as the if condition immediately below — inline the check and drop the variable. golangci-lint misses it because the assignment makes it look 'used'.

aheritier · 2026-05-18T10:17:51Z

+// SessionModelsResponse is the response returned by
+// GET /api/sessions/:id/models. CurrentModelRef is the active override for
+// the named agent (empty when the agent is using its configured default).
+type SessionModelsResponse struct {
+	Agent           string        `json:"agent"`
+	CurrentModelRef string        `json:"current_model_ref,omitempty"`
+	Models          []ModelChoice `json:"models"`
+}


Question — SessionModelsResponse is a brand-new wire type but lives in pkg/runtime next to ModelChoice. The companion SetSessionModelRequest/SetSessionModelResponse sit in pkg/api. The PR description justifies keeping ModelChoice in pkg/runtime because it pre-existed there — but SessionModelsResponse is new code introduced specifically as the wire envelope. Either move it to pkg/api (consistent with the other request/response types) or move all three to pkg/runtime and explain in pkg/api/types.go why model types are an exception. Mixed placement makes pkg/api incomplete as a 'what's on the wire' reference.

aheritier · 2026-05-18T10:17:53Z

+		if !hadOverride {
+			rollback = ""
+		}
+		if rbErr := rs.runtime.SetAgentModel(ctx, agentName, rollback); rbErr != nil {


Non-blocking — Already raised by @aheritier in round 2 and not addressed in head. Rollback still uses the request ctx — if the original request was cancelled and that's what caused UpdateSession to fail, the rollback inherits the cancellation. context.WithoutCancel(ctx) (or a fresh background ctx with a short timeout) would protect the recovery path. One-line change with a clear correctness argument; worth fixing now.

dgageot requested a review from a team as a code owner May 13, 2026 14:01

trungutt previously approved these changes May 13, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

dgageot dismissed trungutt’s stale review via be9e8ae May 13, 2026 14:19

aheritier reviewed May 17, 2026

View reviewed changes

dgageot added 11 commits May 18, 2026 10:24

feat: support model overrides when creating sessions

2309007

Allow callers to specify agent_model_overrides and custom_models_used when creating sessions via POST /api/sessions, and wire the runtime with model switcher configuration to apply those overrides when the session is first run.

fix: add backward compat note for POST /sessions/:id/model

3ed8b94

PATCH is the canonical verb for updating the agent's model. POST is also accepted because pkg/runtime Client.SetAgentModel (used by RemoteRuntime) was historically a POST; keep both so a client upgrade is not required.

refactor(runtime): extract DecorateModelChoices and remove duplication

e38635f

fix(runtime): make DecorateModelChoices defensively copy input slice

90c7b4c

fix(server): map unknown errors from setSessionModel to 500, not 400

baf326c

fix(server): release sm.mux during runtime I/O on model endpoints

49d1b09

docs(api-server): document model switching endpoints

da48ca8

Add GET/PATCH/POST /api/sessions/:id/model(s) to the endpoint table. POST is accepted for backward compatibility with RemoteRuntime's historical SetAgentModel implementation. Assisted-By: docker-agent

dgageot force-pushed the board/54bbc2d7029b6f59 branch from be9e8ae to da48ca8 Compare May 18, 2026 08:38

docker-agent reviewed May 18, 2026

View reviewed changes

aheritier approved these changes May 18, 2026

View reviewed changes

Conversation

dgageot commented May 13, 2026

What changes

POST /api/sessions honours model selection (#2780)

Runtime model switching over HTTP (#2781)

Implementation notes

Tests

Validation

Commits

Uh oh!

This comment was marked as outdated.

Uh oh!

dgageot commented May 13, 2026

[MEDIUM] DecorateModelChoices mutates caller-supplied slice — fixed in 96ea8bb

[MEDIUM] Unknown errors mapped to 400 — fixed in ad1366a

[HIGH] Lock held across blocking runtime I/O — fixed in be9e8ae

Validation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dgageot commented May 18, 2026

Uh oh!

dgageot commented May 18, 2026

Uh oh!

docker-agent left a comment

Choose a reason for hiding this comment

Assessment: 🟡 NEEDS ATTENTION

Uh oh!

docker-agent May 18, 2026

Choose a reason for hiding this comment

Uh oh!

aheritier left a comment

Choose a reason for hiding this comment

Uh oh!

aheritier May 18, 2026

Choose a reason for hiding this comment

Uh oh!

aheritier May 18, 2026

Choose a reason for hiding this comment

Uh oh!

aheritier May 18, 2026

Choose a reason for hiding this comment

Uh oh!

aheritier May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`POST /api/sessions` honours model selection (#2780)

[MEDIUM] `DecorateModelChoices` mutates caller-supplied slice — fixed in `96ea8bb`

[MEDIUM] Unknown errors mapped to 400 — fixed in `ad1366a`

[HIGH] Lock held across blocking runtime I/O — fixed in `be9e8ae`